102 research outputs found
Scalable Planning and Learning for Multiagent POMDPs: Extended Version
Online, sample-based planning algorithms for POMDPs have shown great promise
in scaling to problems with large state spaces, but they become intractable for
large action and observation spaces. This is particularly problematic in
multiagent POMDPs where the action and observation space grows exponentially
with the number of agents. To combat this intractability, we propose a novel
scalable approach based on sample-based planning and factored value functions
that exploits structure present in many multiagent settings. This approach
applies not only in the planning case, but also in the Bayesian reinforcement
learning setting. Experimental results show that we are able to provide high
quality solutions to large multiagent planning and learning problems
Best Response Bayesian Reinforcement Learning for Multiagent Systems with State Uncertainty
It is often assumed that agents in multiagent systems with state uncertainty have full knowledge of the model of dy- namics and sensors, but in many cases this is not feasible. A more realistic assumption is that agents must learn about the environment and other agents while acting. Bayesian methods for reinforcement learning are promising for this type of learning because they allow model uncertainty to be considered explicitly and offer a principled way of dealing with the exploration/exploitation tradeoff. In this paper, we propose a Bayesian RL framework for best response learn- ing in which an agent has uncertainty over the environment and the policies of the other agents. This is a very general model that can incorporate different assumptions about the form of other policies. We seek to maximize performance and learn the appropriate models while acting in an online fashion by using sample-based planning built from power- ful Monte-Carlo tree search methods. We discuss the theo- retical properties of this approach and experimental results show that the learning approaches can significantly increase value when compared to initial models and policies
Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs (Extended Version)
Many exact and approximate solution methods for Markov Decision Processes
(MDPs) attempt to exploit structure in the problem and are based on
factorization of the value function. Especially multiagent settings, however,
are known to suffer from an exponential increase in value component sizes as
interactions become denser, meaning that approximation architectures are
restricted in the problem sizes and types they can handle. We present an
approach to mitigate this limitation for certain types of multiagent systems,
exploiting a property that can be thought of as "anonymous influence" in the
factored MDP. Anonymous influence summarizes joint variable effects efficiently
whenever the explicit representation of variable identity in the problem can be
avoided. We show how representational benefits from anonymity translate into
computational efficiencies, both for general variable elimination in a factor
graph but in particular also for the approximate linear programming solution to
factored MDPs. The latter allows to scale linear programming to factored MDPs
that were previously unsolvable. Our results are shown for the control of a
stochastic disease process over a densely connected graph with 50 nodes and 25
agents.Comment: Extended version of AAAI 2016 pape
Influence-Optimistic Local Values for Multiagent Planning --- Extended Version
Recent years have seen the development of methods for multiagent planning
under uncertainty that scale to tens or even hundreds of agents. However, most
of these methods either make restrictive assumptions on the problem domain, or
provide approximate solutions without any guarantees on quality. Methods in the
former category typically build on heuristic search using upper bounds on the
value function. Unfortunately, no techniques exist to compute such upper bounds
for problems with non-factored value functions. To allow for meaningful
benchmarking through measurable quality guarantees on a very general class of
problems, this paper introduces a family of influence-optimistic upper bounds
for factored decentralized partially observable Markov decision processes
(Dec-POMDPs) that do not have factored value functions. Intuitively, we derive
bounds on very large multiagent planning problems by subdividing them in
sub-problems, and at each of these sub-problems making optimistic assumptions
with respect to the influence that will be exerted by the rest of the system.
We numerically compare the different upper bounds and demonstrate how we can
achieve a non-trivial guarantee that a heuristic solution for problems with
hundreds of agents is close to optimal. Furthermore, we provide evidence that
the upper bounds may improve the effectiveness of heuristic influence search,
and discuss further potential applications to multiagent planning.Comment: Long version of IJCAI 2015 paper (and extended abstract at AAMAS
2015
Dec-POMDPs as Non-Observable MDPs
A recent insight in the field of decentralized partially observable Markov decision processes (Dec-POMDPs) is that it is possible to convert a Dec-POMDP to a non-observable MDP, which is a special case of POMDP. This technical report provides an overview of this reduction and pointers to related literature
Structure in the Value Function of Two-Player Zero-Sum Games of Incomplete Information
Zero-sum stochastic games provide a rich model for competitive decision
making. However, under general forms of state uncertainty as considered in the
Partially Observable Stochastic Game (POSG), such decision making problems are
still not very well understood. This paper makes a contribution to the theory
of zero-sum POSGs by characterizing structure in their value function. In
particular, we introduce a new formulation of the value function for zs-POSGs
as a function of the "plan-time sufficient statistics" (roughly speaking the
information distribution in the POSG), which has the potential to enable
generalization over such information distributions. We further delineate this
generalization capability by proving a structural result on the shape of value
function: it exhibits concavity and convexity with respect to appropriately
chosen marginals of the statistic space. This result is a key pre-cursor for
developing solution methods that may be able to exploit such structure.
Finally, we show how these results allow us to reduce a zs-POSG to a
"centralized" model with shared observations, thereby transferring results for
the latter, narrower class, to games with individual (private) observations
On the Impossibility of Learning to Cooperate with Adaptive Partner Strategies in Repeated Games
Learning to cooperate with other agents is challenging when those agents also
possess the ability to adapt to our own behavior. Practical and theoretical
approaches to learning in cooperative settings typically assume that other
agents' behaviors are stationary, or else make very specific assumptions about
other agents' learning processes. The goal of this work is to understand
whether we can reliably learn to cooperate with other agents without such
restrictive assumptions, which are unlikely to hold in real-world applications.
Our main contribution is a set of impossibility results, which show that no
learning algorithm can reliably learn to cooperate with all possible adaptive
partners in a repeated matrix game, even if that partner is guaranteed to
cooperate with some stationary strategy. Motivated by these results, we then
discuss potential alternative assumptions which capture the idea that an
adaptive partner will only adapt rationally to our behavior.Comment: 9 pages, to be published in The Proceedings of the 39th International
Conference on Machine Learning, 202
- …